-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix config reloads not respecting group interval #3074
base: main
Are you sure you want to change the base?
Conversation
dc9abf9
to
c2fc518
Compare
Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>
c2fc518
to
f4d61d8
Compare
notify/notify.go
Outdated
// If we haven't notified about the alert group before, notify right away | ||
// unless we only have resolved alerts. | ||
if entry == nil { | ||
return len(firing) > 0 | ||
} | ||
|
||
if !entry.IsFiringSubset(firing) { | ||
groupIntervalMuted := len(entry.FiringAlerts) > 0 && entry.Timestamp.After(time.Now().Add(-groupInterval)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any chance we want to use keyNow from the context here?
This is somehow a feature because sending resolved notifications immediately might exacerbate flapping notifications. |
Agreed, but would it be better to make it work like that by default with the ability to configure? |
373cc9e
to
320cb60
Compare
Signed-off-by: Andre Branchizio <andre.branchizio@shopify.com>
320cb60
to
c61f146
Compare
In #3283 I try to solve another problem a similar way. There are some differences though:
|
I have modified this quite a bit in Shopify's alertmanager fork, i do not use 0.99 constant anymore and I also had to solve alert flapping with hard coded check if its a recent entry. The flappy behavior was amplified in how we are running alertmanager because we change https://github.com/prometheus/alertmanager/blob/main/dispatch/dispatch.go#L452 to for what its worth here is the needsUpdate in our forked version
|
Interesting, in your code you use time.Now() which is a wall-clock
but in this PR you use the |
Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Hi! 👋 I found this issue from another PR.
Config reload will re-create all existing aggregation groups. This causes an immediate flush because of this code here: // Immediately trigger a flush if the wait duration for this
// alert is already over.
ag.mtx.Lock()
defer ag.mtx.Unlock()
if !ag.hasFlushed && alert.StartsAt.Add(ag.opts.GroupWait).Before(time.Now()) {
ag.next.Reset(0)
} I think we should wait Group wait seconds. I'm not sure if we need to persist and wait the remainder of the previous Group interval?
Like Simon said, I think this is intended behavior. If you have a group of size 1 (i.e. grouping is disabled) that is flapping between firing and resolved, won't this change make Alertmanager send a notification each time it flaps, rather than smooth it out? |
Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
Add an acceptance test that triggers a config reload and verifies that no early notification is occurring. One of the challenges is which time to use to check for a previous notification. The nflog captures about the time all notifications were sent. That conflicts with the ag.next timer that get's reset before the ag is being flushed. Delays and retries can make these two point in time be different enough for the integration tests to fail. I considered the following ways to fix it: 1.) Modify the nflog.proto to capture the flush time in addition to the successful notification time. 2.) In addition to hasFlushed capture the last flush time and pass it to the context like we do for Now. 3.) Set hashFlushed and reset the timer after the flush has been done. I started with prometheus#3 as it seemeded to have the fewest downsides with things like drift. Based on comments this is no prometheus#1. needsUpdate is based on: prometheus#3074 (comment) Signed-off-by: Holger Hans Peter Freyther <holger@freyther.de>
We have been running this modification at Shopify for the last few weeks and it has solved two issues for us.
Related issues that this addresses:
This change does the following:
TODO:
This change adds to even further the confusion of group interval, and group_wait. Three separate configs seems more appropriate: group_wait (how long we wait to start the evaluation background job for an alert group), group_interval (how frequent an alert group is evaluated for dispatch), group_repeat_interval (how long to wait before sending a notification about new alerts that are added to a group of alerts. This would allow for more fine tuning alert dispatching.